Reduce find_indirect_clusters() complexity from O(exp(n)) to O(n) #2071

mludvig · 2025-06-10T03:44:59Z

I was trying to generate a testset from KnowledgeGraph(nodes: 1861, relationships: 912103) and it would go on for two days in a row without any progress. So I did some profiling to see what's going on on a smaller KG and found that most of the time was being spent in find_indirect_clusters().

Here is an optimised implementation that seem to produce similar results but way faster. On my reduced size KG I'm down from 5 min to a fraction of a second and on the full KG I'm down from never finished to a few seconds.

Lukevanl · 2025-06-25T09:31:29Z

Ran into the exact same issue and indeed made a similar fix, would be nice to get it merged since this limits the current code only to very small graphs :)

anistark · 2025-08-28T10:16:43Z

Thanks for the PR @mludvig 🙌🏼 It does seemigly fix a critical issue.

Could you please add a test as well? So, we can validate this works and also for future serve as our north star.

Also, please rebase and check on the merge conflict.

jjmachan · 2025-08-28T13:55:54Z

Hey @mludvig! 👋

Hope you're doing well! I really loved your work on reducing find_indirect_clusters() complexity from O(exp(n)) to O(n) - it's exactly the kind of thoughtful improvement that makes Ragas better for everyone.

Quick question for you - we're trying to figure out what to do with the Testset Generation module as we gear up for v0.4, and since you've been working in this space, I'd love to get your take on it.

Mind checking out this discussion when you have a moment?
🔗 #2231

Basically we're wondering if we should keep it as part of the core library, spin it off into its own thing, or maybe even retire it if folks aren't really using it much. No pressure at all, but given your experience with algorithm optimization and performance improvements, your perspective would be super helpful!

Just drop a 👍 👎 or 🚀 on the issue, or feel free to share any thoughts you have.

Thanks for being awesome! 🙏

anistark · 2025-09-01T09:50:08Z

Hi @mludvig

Have you also checked the diff in result from old and new approaches?

Would recommend adding a new function rather than replace the existing ones to have existing users not break, while providing an alternate algo with hops as you've suggested.

@mludvig

…n and sampling (#2144) Like @mludvig found in #2071, I notice the find_indirect_clusters' use of exhaustive depth-first search to be a significant barrier to use on any, even moderately substantial, knowledge graph. That PR uses BFS to identify a set of disjoint clusters involving the source node (each node appears in at most one cluster) whereas the original find_indirect_clusters identifies all sets of clusters up to length depth_limit from each node. @mludvig, if I'm out of line here, please correct me! The approach in this PR instead identifies neighborhoods in the graph using a Leiden clustering algorithm and samples from the neighborhoods. I believe this to be a better approach - in my testing it is even faster than BFS, and the function returns something more in line to the original `find_indirect_clusters` implementation. I would have preferred (and, in fact, originally tried) using a Clique Percolation Method approach because it allows nodes to belong to multiple neighborhoods; however CPM also ends up running into NP-hard / NP-complete runtime issues. This PR does add two dependencies to Ragas - networkx and scikit-network.

anistark · 2025-09-09T10:14:53Z

Closing this as #2144 covered the changes.

Optimize find_indirect_clusters()

ff63f8e

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 10, 2025

ahgraber mentioned this pull request Jul 25, 2025

Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144

Merged

anistark closed this Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce find_indirect_clusters() complexity from O(exp(n)) to O(n) #2071

Reduce find_indirect_clusters() complexity from O(exp(n)) to O(n) #2071

Uh oh!

mludvig commented Jun 10, 2025

Uh oh!

Lukevanl commented Jun 25, 2025

Uh oh!

anistark commented Aug 28, 2025 •

edited

Loading

Uh oh!

jjmachan commented Aug 28, 2025

Uh oh!

anistark commented Sep 1, 2025

Uh oh!

anistark commented Sep 9, 2025

Uh oh!

Uh oh!

Reduce find_indirect_clusters() complexity from O(exp(n)) to O(n) #2071

Reduce find_indirect_clusters() complexity from O(exp(n)) to O(n) #2071

Uh oh!

Conversation

mludvig commented Jun 10, 2025

Uh oh!

Lukevanl commented Jun 25, 2025

Uh oh!

anistark commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjmachan commented Aug 28, 2025

Uh oh!

anistark commented Sep 1, 2025

Uh oh!

anistark commented Sep 9, 2025

Uh oh!

Uh oh!

anistark commented Aug 28, 2025 •

edited

Loading